Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646) by jeremydmiller · Pull Request #2649 · JasperFx/wolverine

jeremydmiller · 2026-05-01T12:59:41Z

Closes #2646.

Summary

The three durability agents — Wolverine.RDBMS.DurabilityAgent, RavenDbDurabilityAgent, CosmosDbDurabilityAgent — all relied on the default IAgent.CheckHealthAsync: Status == Running ? Healthy : Unhealthy. That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to flag: a healthy-looking agent silently failing to reach the store, the dead-letter queue ballooning because handlers are dying, or a recovery loop that's not draining a stuck batch.

Threads three new persistence signals through every durability agent:

Persistence reachability — each agent's poll loop wraps its tick in try/catch and feeds the outcome into a per-agent DurabilityHealthSignals instance. CheckHealthAsync also pings the store via FetchCountsAsync. One failed cycle ⇒ Degraded with the underlying error message; N consecutive failures (default 3, DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold) ⇒ Unhealthy.
Dead-letter queue growth — between consecutive evaluations, compare the PersistedCounts.DeadLetter delta against DurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold (default 100/min). Above threshold ⇒ Degraded with the rate in the description.
Stuck recovery / scheduled-job pollers — if persisted inbox+outbox (or scheduled) counts stay non-zero and never decrease across DurabilitySettings.HealthStuckPollCycleThreshold consecutive evaluations (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue" case the issue calls out.

Status precedence: a non-Running status always returns Unhealthy first; then the consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple Degraded signals are joined into a single ;-separated description so operators see the full picture in one tooltip.

DurabilityHealthSignals is intentionally public so per-store agents from the RavenDb / CosmosDb assemblies (which do not have InternalsVisibleTo into Wolverine) can use it directly. The class is deliberately small: shared mutable state, RecordPollSuccess / RecordPollFailure mutators, and a single Evaluate() that takes the current PersistedCounts snapshot.

Files

src/Wolverine/Persistence/Durability/DurabilityHealthSignals.cs (new) — the shared evaluator.
src/Wolverine/DurabilitySettings.cs — three new threshold properties (defaults: 100/min DLQ growth, 3 stuck cycles, 3 consecutive failures).
src/Persistence/Wolverine.RDBMS/DurabilityAgent.cs — replaces the existing _successCount / _exceptionCount rolling logic with the shared signals; adds count-based signals.
src/Persistence/Wolverine.RavenDb/Internals/Durability/RavenDbDurabilityAgent.cs — adds CheckHealthAsync override; wraps each recovery + scheduled-job tick in try/catch to feed the signals.
src/Persistence/Wolverine.CosmosDb/Internals/Durability/CosmosDbDurabilityAgent.cs — same shape as RavenDb.

Test plan

New CoreTests/Persistence/durability_health_signals_tests covers the helper in isolation: status precedence, single-failure Degraded, threshold-based Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled with reset behaviour, multi-signal aggregation, and the diagnostic counter accessor. 12/12 green.
Full CoreTests suite green: Failed: 0, Passed: 1421, Total: 1421, Duration: 3m 53s.

🤖 Generated with Claude Code

…tence signals (#2646) The three durability agents (Wolverine.RDBMS, Wolverine.RavenDb, Wolverine.CosmosDb) all relied on the default IAgent.CheckHealthAsync — Status==Running ? Healthy : Unhealthy. That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to flag: a healthy-looking agent silently failing to reach the store, the DLQ ballooning because handlers are dying, or a recovery loop that's not draining a stuck batch. This commit threads three new persistence signals through every durability agent: 1. **Persistence reachability** — each agent's poll loop now wraps its tick in a try/catch and feeds the outcome into a per-agent `DurabilityHealthSignals` instance. CheckHealthAsync also pings the store via FetchCountsAsync. One failed cycle ⇒ Degraded with the underlying error message; N consecutive failures (default 3, `DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold`) ⇒ Unhealthy. 2. **Dead-letter queue growth** — between consecutive evaluations, compare the `PersistedCounts.DeadLetter` delta against `DurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold` (default 100/min). Above threshold ⇒ Degraded with the rate in the description. 3. **Stuck recovery / scheduled-job pollers** — if the persisted inbox+outbox total (or scheduled count) stays non-zero and never decreases across `DurabilitySettings.HealthStuckPollCycleThreshold` consecutive evaluations (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue" case the issue calls out. Status precedence: a non-Running status always returns Unhealthy first; then the consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple Degraded signals are joined into a single `;`-separated description so operators see the full picture in one tooltip. `DurabilityHealthSignals` is intentionally public so per-store agents from the RavenDb / CosmosDb assemblies (which do not have InternalsVisibleTo into Wolverine) can use it directly. The class is deliberately small: shared mutable state, RecordPollSuccess/Failure mutators, and a single Evaluate() that takes the current PersistedCounts snapshot. Test plan: - New CoreTests/Persistence/durability_health_signals_tests covers the helper in isolation: status precedence, single-failure Degraded, threshold-based Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled with reset behaviour, multi-signal aggregation, and the diagnostic counter accessor. 12/12 green. - Full CoreTests suite green: Failed: 0, Passed: 1421, Total: 1421. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremydmiller merged commit 64b31ea into main May 1, 2026
19 of 21 checks passed

This was referenced May 2, 2026

Bump WolverineFx from 5.24.0 to 5.36.0 RobertLynJA/Blog2#702

Open

Bump Scalar.AspNetCore and WolverineFx zribktad/API-Template#114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649

Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649
jeremydmiller merged 1 commit intomainfrom
2646-enrich-durability-health

jeremydmiller commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeremydmiller commented May 1, 2026

Summary

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant